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Abstract 

Good term selection is an important issue for an automatic query ex- 
pansion (AQE) technique. AQE techniques that select expansion terms 
from the target corpus usually do so in one of two ways. Distribution based 
term selection compares the distribution of a term in the (pseudo) rele- 
vant documents with that in the whole corpus / random distribution. Two 
well-known distribution-based methods are based on Kullback-Leibler Di- 
vergence (KLD) [H] and Bose-Einstein statistics (Bol) Q]. Association 
based term selection, on the other hand, uses information about how a 
candidate term co-occurs with the original query terms. Local Context 
Analysis (LCA) [31] and Relevance-based Language Model (RM3) [TS] 
are examples of association-based methods. Our goal in this study is to 
investigate how these two classes of methods may be combined to improve 
retrieval effectiveness. We propose the following combination-based ap- 
proach. Candidate expansion terms are first obtained using a distribution 
based method. This set is then refined based on the strength of the as- 
sociation of terms with the original query terms. We test our methods 
on 11 TREC collections. The proposed combinations generally yield bet- 
ter results than each individual method, as well as other state-of-the-art 
AQE approaches. En route to our primary goal, we also propose some 
modifications to LCA and Bol which lead to improved performance. 



1 Introduction 

Consider a user's query Q and a relevant document D from a document col- 
lection. Q and D may use different vocabulary to refer to the same concept. 
Information Retrieval (IR) systems that rely solely on keyword-matching may 
not detect a match between Q and D, and may therefore not retrieve D in 
response to Q. This is the well-known vocabulary mismatch problem in IR. 

A good retrieval system must solve this problem by bridging the vocabulary 
gap that exists between useful documents and the user's query. Query Expan- 
sion (QE) is an important technique that attempts to increase the likelihood 
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of a match between the query and relevant documents by adding related terms 
(called expansion terms) to a user's query. 

A wide variety of methods for Automatic Query Expansion (AQE) have 
been proposed over the last 15-20 years. These methods find related terms 
from different sources such as the target corpus, linguistic resources like Word- 
net [H], thesauri [55], ontologies [5], the World Wide Web, Wikipedia [H] and 
query logs [12] . A recent survey of such techniques can be found in [9] . Of all 
these techniques, methods that use the target corpus as a source of expansion 
terms are among the most widely used because they are simple and require no 
additional resources. 

Target-corpus-based AQE techniques can be broadly classified into two 
groups: distribution based and association based. Distribution based methods 
select terms by comparing the distribution of a term in the (pseudo) relevant 
documents with its distribution in the whole corpus. Broadly, such methods 
select terms that are more likely to occur in the (pseudo) relevant documents 
than in a document chosen randomly from the entire corpus. On the other hand, 
association based methods select expansion terms on the basis of their associ- 
ation (or co-occurrence) with all query terms. A term that tends to co-occur 
with all / many of the query terms is regarded as a good expansion term. 

While a number of distribution / association-based QE techniques have been 
shown to be effective on average (i.e. when their overall performance across a 
large set of queries is measured), the impact of different QE techniques on 
individual queries can vary greatly. 





Baseline 


Assoc. based 


Distr. based 


MAP 


0.218 


0.250 ( + 14.8) 


0.257 (+ 18.0%) 


Better on 




81 queries 


91 queries 



Table 1: Potential improvement obtainable in principle by judiciously choosing QE techniques 

Table [1] shows the Mean Average Precision (MAP) scores for three retrieval 
methods on TREC queries 301-450 (for more details, please see Section 2]): 
a baseline strategy that uses original, unexpanded queries, and representative 
distribution-based [5] and association-based [5T] QE methods. The QE methods 
are superior to the baseline on average, but they result in decreased performance 
for a number of queries. Further, while the overall performance figures for these 
two QE methods are comparable, each of these methods outperforms the other 
on about half the queries used in this experiment. 

As these two methods work in different ways, our hypothesis is that if we 
combine these two methods by considering both distribution information and as- 
sociation information, we should be able to improve overall performance. In this 
study, therefore, we investigate the possibility of improving retrieval effective- 
ness by combining association- and distribution-based QE approaches. We first 
select two well-known, representative method from each category, viz. LCA [51] . 
RM3 [15] (association-based) and KLD [8], Bol [1] (distribution-based). Next, 
we introduce some simple modifications in the basic formulae of some of these 
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methods in order to improve their performance. We verify that these modifica- 
tions indeed result in better retrieval effectiveness. Finally, the two approaches 
are combined as follows: we select a relatively large number of candidate expan- 
sion terms using the distribution based method. Some of these arc filtered out 
using information from the association based method. The refined set is finally 
used for query expansion. 

Wc test our combined method on eleven TREC collections. Our proposed 
method yields significant improvements on all collections over a baseline that 
uses the original, unexpanded queries. More importantly, the combined methods 
yield improvements over the individual AQE methods for most of the collections. 

In summary, this study makes the following contributions. 

• It proposes refinements for some well-known QE methods. 

• It demonstrates that a combination of distribution based and association 
based methods outperforms the individual methods as well as state-of-the- 
art QE methods, such as the approaches proposed in p~5l [T] . 

In the next section, we discuss the relationship between this study and re- 
lated work. Section[3]briefly reviews the existing AQE methods that are used in 
this study, our modifications of these methods, as well as the proposed method 
for combining AQE techniques. Section @] describes the experimental setting 
that we used. Results comparing the proposed methods with existing ones are 
presented in Section [5j Finally, Section [6] summarizes some related issues that 
need to be studied in future work. 

2 Related work 

Early work on automatic query expansion dates back to the 1960s. Rocchio's 
relevance feedback method [29] is still used in its original and modified forms 
for AQE. The availability of the TREC collections, and the widespread success 
of AQE on these collections stimulated further research in this area. Carpincto 
and Romano [9] provide a recent and comprehensive survey of AQE techniques. 
We focus here on some important AQE techniques that are either distribution- 
or association-based. 

Association- based QE techniques. Early work on association-based AQE 
includes "concept-based" QE [26] and phrasefinder [11]. Both methods make 
use of term co-occurrence information extracted from a corpus. Local context 
analysis (LCA) [31] [30] is another well-known method that also selects expansion 
terms based on whether they have a high degree of co-occurrence with all query 
terms. However, in LCA, co-occurrence information is obtained from a set 
of top-ranked documents retrieved in response to the original query, rather 
than the whole target corpus. Relevance-based language models [T7] constitute 
another, more recent, co-occurrence based approach. This method is based 
on the Language Modeling framework. The query and relevant documents are 
all assumed to be generated from an underlying relevance model. This model 
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is estimated based on (only) the pseudo relevant documents for a particular 
query. This approach was subsequently refined by AbdulJaleel et al. [15]. The 
refinement, called RM3, incorporates the original query when estimating the 
relevance model. According to a comparative study by Lv ct al. RM3 is 

the most effective and robust among a number of state-of-the-art AQE methods. 
RM3 is frequently used as a baseline against which several recent QE methods 
have been compared [23l 120] [3| [6| [16] . 

Distribution-based QE techniques. As early as 1978, Doszkocs [12] pro- 
posed the interactive use of an associative dictionary that was constructed based 
on a comparative analysis of term distributions. Also well known is Robertson's 
analysis of term selection for query expansion |27] . More recently, Carpineto et 
al. [8] proposed an effective QE method based on information theoretic prin- 
ciples. This method uses the Kullback-Leibler divergence (KLD) between the 
probability distributions of terms in the relevant (or pseudo-relevant) documents 
and in the complete corpus. 

Amati [Tj proposes a new distribution based method which uses Bose-Einstein 
statistics. This method also calculates the divergence between the distribution 
of terms in the pseudo relevant document set and a random distribution. 

Efforts have also been made to combine AQE methods in various ways to im- 
prove retrieval effectiveness. Carpineto et al. |10j combined the scoring functions 
of a number of methods, all of them distribution-based, to obtain improvements. 
In contrast, we combine a distribution-based method with an association-based 
method (based on our belief that these two classes of methods offer different ad- 
vantages). Also, rather than combining scores, we use one method to refine the 
set of terms selected by the other Our approach is somewhat similar in spirit 
to a method proposed by Cao et al. [7], in which terms selected using standard 
pseudo relevant feedback (PRF) are refined using a classifier that is trained to 
differentiate between useful and harmful candidate expansion terms. Our work 
is most strongly related to that of Perez- Agiiera and Araujo [23] , who also com- 
bine co-occurrence-based and distribution-based methods. The combination is 
relatively straightforward: one method is used for term selection and the other 
for weighting. Word co-occurrence is measured using the Tanimoto coefficient. 
Distributional differences are measured based on KLD or Bose-Einstein statis- 
tics. The methods are tested on a relatively small Spanish dataset. We use the 
well-known LCA and RM3 method (instead of Tanimoto coefficient) to quantify 
term association. Also, instead of simply using one method for term selection 
and the other for weighting, we combine both methods for selection. Finally, 
we test our method on a number of large TREC datasets. 

1 Of course, this can also, strictly speaking, be regarded as a combination where one com- 
ponent is very highly weighted. 
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3 Methods 



3.1 Basic Methods 

We first review KLD, Bol, LCA and RM3, the existing methods that form the 
base of our approach. 

3.1.1 Distribution based method I: KLD 

The approach proposed by Carpineto et al. [5] is one of the two distribution 
based term ranking methods used in this study. In this method, all terms 
in the pseudo relevant set are treated as candidate expansion terms. Let R 
and C represent the (pseudo) relevant documents (PRD) and the whole corpus 
respectively. We use p r and p c to denote the unigram probability distribution of 
terms in R and C respectively; p r and p c are calculated as shown in Equations[T] 
and [2] (tf(t, d) represents the term frequency of term t in document d). 

E tf(t, d) 

deR fed 

E d ) 

Pc(t) = E^E tf(f, d) (2) 

dec fed 

The contribution of a term to the divergence between p r and p c is given by 
Equation [3] Terms for which this contribution is the largest are selected as 
expansion terms. 

S(t)= Pr (t)*log^- (3) 

S(t) is also used as the term weight of a candidate expansion term t. 

In our experiments with KLD (and other methods), we use Equations [4] 
to O to merge the original query terms with the candidate expansion terms to 
formulate the final expanded query. The weights of original query terms are 
normalized using the maximum original query term weight (Eqn. 0]); weights of 
expansion terms are similarly normalized (Eqn. [5])- These weights are simply 
added together to obtain the final weight of a term t in the expanded query 
(Eqn.©. 

_ ' + (4) 

l + nualog(t/(t',Q)) 
S{t) 

score exp {t) = — — (5 

max 6(e) 
fedePRD 

score{t) = score or i g (t) + score exp (t) (6) 
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3.1.2 Distribution based method II: Bol 

The second, more recent, distribution based term ranking model we considered 
is Bol, which is the most effective variant of the Divergence From Randomness 
(DFR) term weighting model [35] HI] ■ I n this model based on Bose-Einstein 
statistics, the informativeness of a term t is measured by the divergence be- 
tween its distribution in the top ranked documents and a random distribution. 
Specifically, the score of a candidate expansion term t is given by 

S(t)=(j2 tfM)) *log 2 ( 1+ / faV ( fr? ] ) +log 2 (l + f avg (t,C)) (7) 

KdGPRD J V lavg{t,C) J 

where 

f avg (t,C) = J2tf(t,d)/N (8) 

dec 

denotes the average term frequency of t in the collection (N is the number of 
documents in the collection). As in Section 13.1.11 we use Equations|4H6]to merge 
the original query with the expansion terms and formulate the new expanded 
query. 



3.1.3 Modified Bol 

Taking the Bol formula as a starting point, we modify it as follows to obtain a 
more effective scoring function for an expansion term t. First, an occurrence of 
t in a top-ranked document is considered more important than an occurrence 
in a lower ranked document. Thus, instead of using tf(t, d) directly, we scale 
the term frequencies by the normalized similarity score of the corresponding 
document. We then incorporate inverse collection frequency information as 
shown in Equation 1101 

While the tf factor in Equation (fTTJ| is indicative of the distribution of t in 
the top ranked set, the ictf factor reflects the distribution of the term in the 
collection. 

ictf(t) = log 10 (9) 
^— ' \ max Simla', U) I l + ictf(t) 

dePRD \ d'dPRD K '^V 1 W 

Finally, Equations|3]to[n]are used to merge the original query with the expansion 
terms. 



3.1.4 Association based method I: LCA 

LCA [21] is one of the most well-known association based term selection meth- 
ods. This method also considers all terms from the top ranked set as candidate 
expansion terms. Equations [TT1 through [T4l show how the co-occurrence is cal- 
culated for a candidate term t and a query Q consisting of terms q\ , . . . , qu (Nt 
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has its obvious meaning, PRD denotes the set of pseudo-relevant documents, 
n = \PRD\, S is set to 0.1 as suggested in [31]). 

idf t = mi»(bg 10 (JV/JVt)/5.0, 1.0) (11) 

co(t, qi )= tf(t,d)*tf( qi ,d) (12) 

dePRD 

i /. \ log w (co(t,qi) + l)*idf t 
codegree(t,qi) = — (13) 

fe 

S(t) = idf q% * logi (<5 + codegree(t, %)) (14) 

The T terms with the highest S(t) scores are selected as expansion terms. Fi- 
nally, the j-th "best" term is weighted according to Equation [TBI 

0.9 * j 

score exp (t) = 1.0 — — (15) 

We did not use noun-phrases or passage level retrieval, since the authors show 
that these refinements do not have much impact. Our experiments confirm that 
our implementation yields very similar results for the collections and settings 
mentioned in [31 j - 

3.1.5 Modified LCA 

Our implementation of the above formulae did not yield the expected improve- 
ments. A failure analysis suggests that Equation [T2l might be the culprit. For 
example, consider the TREC4 query: "How has affirmative action affected the 
construction industry?" . Two terms papuc (Pennsylvania Public Utility Com- 
mission) and limerick are very highly ranked among candidate terms by Equa- 
tion Q31 even though these are not useful expansion terms. This is because in 
one top document, the word 'papuc' occurs 21 times and a query word ('con- 
struction') occurs 35 times. The multiplication of raw term frequencies in Equa- 
tion [12] results in a very high weight for the term 'papuc'. A similar problem 
occurs in case of 'limerick', which occurs 17 times in one document. 

Our hypothesis is that the number of co-occurrences of a term pair can only 
be as large as the minimum term frequency of the two terms under consideration. 
We also hypothesize that co-occurrences in a document are more important if the 
document is "close" (or similar) to the query. Finally, we use the idf factoJl for 
a candidate expansion term when calculating its co-occurrence (Equation [17]) ; 
it is no longer used when calculating co-degree (Equation IT5|). Equations [TB1 - [l~9l 
define our modified approach for calculating the association between a candidate 
term t and the query Q. 

N - N t + 0.5 
ldft ^ l0g10 N t + 0.5 (16) 



2 Note that we use Robertson's idf formula [28] (Equation [16} instead of Equation llll 
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co(t,qi) = ^2 ( mi 

dGPRD \ 



rmn(tf(t,d),tf( qi ,d))* 

(17) 



i -At a\ Sim(d,<9) 
max(irff tV9i ,0) - 



max Sim(d',Q) 

d'ePRD 



where idf t \j qi denotes the idf of term t or qi, based on whose term frequency is 
minimum in document d. 

7 x logi (co(f,gO + 1) 
codegree{t, %) = ^ — (18) 

fc 

5(t) = * log 10 (<5 + codegree(t, q,)) (19) 

»=i 

As before, the T terms with the highest association scores (S(t)) are selected as 
expansion terms. The hnal term weights in the expanded query are determined 
using Equations HI to |H1 

3.1.6 Association based method II: RM3 

Relevance-based language models [17l [15] constitute a more recent association- 
based approach. In this approach, the association S(t) between a word t and a 
query Q = qi, . . . ,q k can be measured by P(t, (ft., ... , the joint probability 
of observing the word together with the query words, when these words are 
all sampled from an (unknown) relevance model. This relevance model con- 
sists of a finite universe M. of unigram distributions each of which corresponds 
to a (pseudo) relevant document. Under the assumption that t,q\, . . . ,qk are 
independently and identically sampled from M 6 M, 

S(t) = P(t, qi ,...,q k ) 

k 

= Y, P{M)P{t\M)J{P{q i \M) 

MeM i=l 

1 „ I tf(t,d) A tffej)+£Pfejg) \ 

~ *^„LA~ li wr* r ( ' 

where /i = 2500 is a smoothing parameter, and P{qi\C) = p c (<Zi)- Equations [2~T1 
to [23] show, as before, how the expanded query terms are added to the original 
query. This implementation duplicates the LEMUR RM3 method. However, we 
used the i.i.d. sampling approach instead of the conditional sampling method 
recommended in [17| . since this gave us better results. 

score exp (t) = — ^ s ^ (21) 

dePRD fed 



score orig {t) = g (22) 
score(t) — a * score exp (t) + (1 — a) * score or i g (t), where < a < 1 (23) 

3.2 Combining association based method with distribu- 
tion based method 

Section 13.11 reviews two different types of query expansion methods. In this 
section, we describe a hybrid approach that combines the above methods to 
improve retrieval effectiveness. 

We conducted some preliminary experiments to explore various ways to com- 
bine individual methods. Our first attempt involved simply adding up the nor- 
malized weights of the expansion terms as computed by the individual methods. 
This particular method did not perform better than the individual methods. 
Next, we tried to apply the methods sequentially: the original query is ex- 
panded using one of the methods, and the expanded query is then used as the 
initial query for the other method and expanded further. This approach also 
results in a performance drop. The final approach that we tried also applies the 
methods sequentially, but in a different way. One of the methods is used first 
to create a large expanded query. This query is then refined (instead of being 
expanded further) using the other method. This method turns out to work well, 
and yields significant improvements over the individual methods. 

We can see from Table [5] that the distribution-based methods generally per- 
form better than the association-based methods on most of the test collections 
used in our experiments. We therefore choose a distribution-based method — 
KLD (Equation [3J or Bol (Equation [TU]) — to first select (and weight) a rela- 
tively large number of candidate terms that occur preferentially in a few top- 
ranked documents, where the proportion of relevant documents is expected to 
be high. This set is then refined using co-occurrence information: terms that do 
not co-occur significantly with original query terms are discarded. Conversely, 
candidate expansion terms that are relatively poorly ranked by the distribution- 
based method have a chance to be included in the final query if they adequately 
co-occur with the original query terms. More precisely, the candidate terms are 
re-ranked using an association-based method — our modified version of LCA 
(Equation [H?| or RM3 (Equation |2"U]) — that looks at a larger number of top- 
ranked documents. The top T terms from this re-ranked list are chosen as the 
final expansion terms. However, we retain the weights of these terms as deter- 
mined by the distribution based method. As before, the final term weights in 
the expanded query are determined using Equations 0] to [6] 

4 Experimental Setup 

Table [2] lists the details of the test collections used in our experiments. As real- 
life queries are very short, we used only the title field of all these queries, except 
for the TREC4 queries, which contain only the description field. Many of the 
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Tabic 2: Test collections 



Query Id. 


# of Queries 


Documents 


TREC123 


150 


TREC disks 1, 2 


51-200 






TREC4 


49 


TREC disks 2, 3 


202-250 






TREC5 


50 


TREC disks 2, 4 


251-300 






TREC678 


150 


TREC disks 4, 5 - CR 


301-450 






ROBnew 


100 


TREC disks 4, 5 - CR 


601-700 






TREC910 


100 


WT10G 


451-550 







queries thus contain only one term, and most of the remainder arc no longer 
than three words; only the TREC4 queries are longer. 

We used the TERRIEfl| retrieval system for our experiments. At the time of 
indexing, stopwords are removed and Porter's stemmer is used as preprocessing. 
All documents and queries are indexed using single terms, no phrases are used. 
The IFB2 variant of the Divergence From Randomness model [2] — a relatively 
recent model that performs well across test collections — is used for term- 
weighting in all our experiments as it performs better compared to the other 
variants available within TERRIER. Parameters are set to the default values 
used in TERRIER. 

Results are evaluated using standard evaluation metrics (Mean Average Pre- 
cision (MAP), precision at top 10 ranks (P@10), and overall recall (number of 
relevant documents retrieved)). Additionally, for each expansion method, we re- 
port the percentage of queries for which the method resulted in an improvement 
in MAP of more than 5% over the baseline (no feedback). 

5 Experimental Results 

We now present experimental results for the QE methods described in Section [3j 
The first set of results presented in Section 15.11 pertain to our implementation 
of well-known QE methods, as well as the proposed refinements to these meth- 
ods. Section 15.21 corresponds to the combination-based method described in 
Section O 

Notation. We use the following labels to denote various techniques in tables / 
figures. In the following tables, results that are statistically significantly better 
(as determined by a two-tailed paired i-test with a confidence level of 95%) than 
the baseline (no feedback), KLD, Bolnew, LCAnew and RM3 are marked with 
the superscripts B, k, b, 1 and r respectively. 

3 http:/ /terrier. org/ 



10 



Name 



Description 



Details in 



KLD Our implementation of the KLD method Section 3.1.1 

Bol Our implementation of the Bol method Section 3.1.2 

Bolncw Modified Bol method Section 3.1.3 

LCA Our implementation of the LCA method Section 3.1.4 

LCAnew Modified LCA method Section [3X5 

RM3 Our implementation of the RM3 method Scction l3.1.6 



KLDLCA Combination of KLD and LCAnew 

KLDRM3 Combination of KLD and RM3 

BolLCA Combination of Bolncw and LCAnew 

BolRM3 Combination of Bolnew and RM3 




Tabic 3: Labels for various QE methods 



5.1 Experiment 1: modified methods 

Baselines. For comparison, we use the following baselines. 

1. No feedback. The original, unexpanded queries are used for retrieval using 
the baseline method described in Section SJ 

2. Bol. For this method, Amati Q] suggested adding T = 10 expansion 
terms from the top D = 3 documents. We use T = 40 and D = 10 
instead, since we wanted a larger number of candidate terms, particularly 
for the combination-based method. Our experiments confirm that we get 
comparable results with these parameter settings. 

3. LCA. To determine the parameters for LCA, we used the TREC678 col- 
lection as a "tuning" dataset, as TREC678 is comparatively recent, and 
contains a large set of queries. We varied the number of top-ranked doc- 
uments (D) from 10 to 50 in steps of 10, while the number of expansion 
terms (T) was varied from 5 to 50 in steps of 5. Xu and Croft [21] rec- 
ommended using D — 70 and T = 70. In our setup, however, a setting 
of D = 10 documents and T — 40 expansion terms works well. Figure Q] 
shows that these settings work well in terms of MAP. We use these values 
on all collections used in our experiments. A similar exercise suggests that 
the same settings can be used for LCAnew as well. 



LCAnew. Our first goal is to verify that the proposed modifications to the LCA 
formula actually yield improvements in retrieval performance. Table 2] shows 
that, with 'title only' queries, our implementation of the original LCA formula 
results in a drop in MAP for almost all collections. Only for the TREC4 collec- 
tion (in which queries consist of a description only), a marginal improvement is 
observed, suggesting that the original method works better for longer queries. 
Indeed, the experiments by Xu and Croft all used relatively long queries, e.g., 
the full TREC3 queries (including title, description and narrative fields), the 
TREC4 queries, and the description field of TREC5 queries. 

Compared to the original formula, the modified formula results in signifi- 
cant improvements in MAP across all data sets. On the ROBncw collection in 
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Figure 1: Performance (MAP) of LCA on TREC678 for different parameter settings. 

particular, it performs very well, outperforming the original method by nearly 
24%. For the TREC4 corpus, an improvement of about 11.41% (over LCA) is 
observed. The modifications thus seem to be effective for both short as well as 
relatively longer queries. The LCAnew method is also better in terms of P@10, 
number of relevant documents retrieved, and robustness. 

Bolnew. Table 2] shows that Bolnew gives better results than Bol across all 
test collections. For the TREC123, ROBnew, and TREC910 collections, these 
improvements are significant. The modified method also yields better PQ10, 
and appears to be more robust across all datasets. With regard to the number 
of relevant documents retrieved, Bolnew is better on most collections. Overall, 
Bolnew appears to be a superior alternative to Bol in all respects. 

Thus, based on Table IU we conclude that LCAnew and Bolnew are more 
effective, and can be used in place of LCA and Bol. 

5.2 Experiment 2: combination methods 

As explained in Section [3.2| in the combination-based approach, we first select 
a large set of candidate terms (T = 100) from D — 10 documents using a 
distribution-based QE method. The association of these candidate terms with 
the query terms is computed using the top D' = 50 documents^, and the best 
T' = 40 terms (as determined by an association-based method) arc included in 
the final query. We report results for a total of 2 x 2 = 4 combinations: KLDLCA 
(LCAnew with KLD), KLDRM3 (RM3 with KLD), BolLCA (LCAnew with 
Bolnew), and BolRM3 (RM3 with Bolnew). 

4 Measuring association scores over the top 30-50 documents works about equally well. 
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Baseline 


LCA 


L C> A. n e w 


Bol 


E!o 1 new 


TREC123 


MAP 


0.218 


0.213 


0.254^* 


0.272 


0.277"^* 








(-2.4) 


(16.4) 


(24.4) 


(26.6) 




P@10 


0.481 


0.472 


0.520 


0.531 


0.545 








f 1 Q\ 

(-1.8) 


(8.2) 


(10.4) 


(13.4) 






16536 


15714 


17475 


18227 


18242 








( ^ n't 


\p. ( ) 




(AU.OJ 
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o 


35 


54 


58 


62 


1 Hii/C4 


MAP 


0.217 


0.219 


0.244"^ 


0.256 


0.259^ 








(1.1) 


(12.6) 


(17.8) 


(19.5) 




P@10 


0.461 


0.400 


0.496 


0.441 


0.467 








(-13.3) 


(7.5) 


{ A A \ 

(-4.4) 


(1.3) 






3482 


3507 


3691 


3854 


3768 








(0.7) 


(6.0) 


(10.7) 


(8.2) 




^> baseline on 


n 


oo 


O 1 


OO 


i 


1 HrjCo 


MAP 


0.157 


0.130 


0.152* 


0.166 


0.168 








(-17.6) 


(-3.1) 


( R A \ 

(5.4) 


(7.0) 




P@10 


0.286 


0.210 


0.238 


0.248 


0.270 








(-26.6) 


(-16.8) 


(-13.3) 


(-5.6) 




rci j cTj 


1936 


1894 


2053 


2194 


2183 








(-2.2) 


(6.0) 


(13.3) 


(12.8) 




> baseline on 





20 


38 


42 


44 


TREC678 


MAP 


0.218 


0.209 


0.250"* 


0.255 


0.257 a 








(-4.2) 


(14.8) 


(16.8) 


(17.7) 




P@10 


0.431 


0.379 


0.420 


0.427 


0.436 








(-12.2) 


(-2.6) 


(-0.9) 


(1.1) 




#rel_ret 


7287 


7367 


8152 


8529 


8463 








(1.1) 


(11.9) 


(17.0) 


(16.1) 




> baseline on 





36 


52 


53 


60 


ROBnow 


MAP 


0.278 


0.264 


0.327 a * 


0.307 


0.331 11 * 








(-5.0) 


(17.6) 


(10.3) 


(19.0) 




P@10 


0.421 


0.385 


0.452 


0.394 


0.433 








(-8.6) 


(7.2) 


(-6.5) 


(2.9) 




#rel_ret 


2887 


2864 


3009 


3178 


3202 








(-0.8) 


(4.2) 


(10.1) 


(10.9) 




> baseline on 





36 


53 


48 


56 


TREC910 


MAP 


0.195 


0.155 


0.175 


0.189 


0.202* 








(-20.6) 


(-10.6) 


(-3-3) 


(3.5) 




P@10 


0.307 


0.231 


0.291 


0.284 


0.304 








(-24.9) 


(-5.3) 


(-7.6) 


(-1.0) 




#rel_ret 


3770 


3440 


3646 


3974 


3948 








(-8.8) 


(-3.3) 


(5.4) 


(4.7) 




> baseline on 





27 


33 


41 


45 



Table 4: Improvements on different datasets obtained by modifying LCA / Bol. 
The "> baseline on" line shows the %-age of queries on which each method 
beats the baseline by > 5%. A * denotes an improvement (over original formula) 
that is statistically significant. 
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Baselines. We compare the combination-based methods with the following 
baselines. 

1. No feedback. Same as in Section 1531 

2. KLD. We find that a setting of D — 10 top-ranked documents and T = 
40 expansion terms works well for KLD across collections. This is in 
agreement with the observations of Carpincto ct al.jS]. 

Note that the results presented here correspond to our implementation of 
KLD within TERRIER. While our implementation provides better results 
than TERRIER'S native implementation of KLD, we were not able to ex- 
actly replicate the results reported in [5]. This is likely due to differences 
between the retrieval functions, indexing or query processing. For exam- 
ple, using full queries (title, desc and narr) on the TREC8 collection, and 
BM25 as the base term-weighting formula, we get MAP scores of 0.2992 
for KLD (compared to a baseline of 0.2625). When using the IFB2 model, 
however, the baseline is higher (MAP = 0.2753), but KLD appears less 
effective (MAP = 0.2850). 

3. Bolnew, LCAnew. As discussed in Section 15.11 for these methods also, 
we use D = 10 documents, and T = 40 terms. 

4. RM3. We use D — 50 documents (as suggested in [17] ) and T = 50 
terms . We set the Dirichlet smoothing parameter (/i) to 2500 and the 
interpolation parameter to 0.5, based on the default settings for these 
parameters in Lemur 0. As before, we used the TREC678 collection to 
verify that these parameter values work well for us. In fact, for a number 
of datasets, our results for RM3 are superior to those reported in other 
recent papers ([3], for example). 

Tablc[S]shows that the proposed combined approaches are statistically signif- 
icantly better than the no-feedback method across all test collections except for 
TREC5 and TREC910. More importantly, the combined methods consistently 
work better than the individual QE methods involved in the combination, as 
well as most of the other standard QE methods. These differences are, by and 
large, statistically significant, with only a few exceptions. Overall, while RM3 
seems to be the best in terms of P@10 in most of the cases, the combination 
based methods are generally the best on all other measures. We now briefly 
discuss each combination in turn. 

KLDLCA. KLDLCA is better than KLD or LCAnew alone on all measures, 
and across all datasets. For 5 out of the 6 collections, the combination yields 
significant improvements in MAP over KLD or LCAnew or both. It is interesting 
to note that for the sixth collection (TREC5), LCAnew results in a drop in 
performance compared to the no-expansion baseline. However, the combinations 
KLDLCA and BolLCA perform better than the baseline as well as KLD. 

5 http:/ /www. lemurproject.org/ 
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Datasct 


Measure 


Baseline 


KLD 


Bolncw 


LCAncw 


RM3 


KLDLCA 


KLDRM3 


BolLCA 


BolRM3 


TREC123 


MAP 


0.218 


0.274 


0.277 


0.254 


0.249 


0.280"""' 


0.277""' 


0.285"*"' 


0.284"' r 








(25.4) 


(26.6) 


(16.4) 


(14.1) 


(28.0) 


(26.8) 


(30.6) 


(29.8) 




P@10 


0.481 


0.537 


0.545 


0.520 


0.511 


0.537 


0.541 


0.553 


0.540 








(11.8) 


(13.4) 


(8.2) 


(6.2) 


(11.8) 


(12.5) 


(15.0) 


(12.3) 




#rel_rct 


16536 


18299 


18242 


17475 


17702 


18585 


18438 


18701 


18639 








(10.7) 


(10.3) 


(5.7) 


(7.1) 


(12.4) 


(11.5) 


(13.1) 


(12.7) 




> baseline on 





62 


62 


54 


64 


67 


65 


67 


68 


TREC4 


MAP 


0.217 


0.261 


0.259 


0.244 


0.252 


0.279"*""' 


0.265" 


0.273"' 


0.265" 








(20.2) 


(19.5) 


(12.6) 


(15.9) 


(28.7) 


(22.3) 


(25.6) 


(21.9) 




P@10 


0.461 


0.455 


0.467 


0.496 


0.516 


0.498 


0.480 


0.498 


0.502 








(-1.3) 


(1.3) 


(7.5) 


(11.9) 


(8.0) 


(4.0) 


(8.0) 


(8.8) 




#rel_rct 


3482 


3815 


3768 


3691 


3689 


3882 


3781 


3846 


3775 








(9.6) 


(8.2) 


(6.0) 


(5.9) 


(11.5) 


(8.6) 


(10.5) 


(8.4) 




> baseline on 





57 


57 


57 


75 


55 


59 


57 


61 


TREC5 


MAP 


0.157 


0.168 


0.168 


0.152 


0.170 


0.171 


0.172* 


0.174' 


0.173 








(6.9) 


(7.0) 


(-3.1) 


(8.2) 


(9.0) 


(9.2) 


(10.4) 


(9.9) 




P@10 


0.286 


0.268 


0.270 


0.238 


0.336 


0.274 


0.280 


0.290 


0.304 








(-6.3) 


(-5.6) 


(-16.8) 


(17.5) 


(-4.2) 


(-2.1) 


(1.4) 


(6.3) 




#rel_rct 


1936 


2184 


2183 


2053 


2077 


2218 


2166 


2226 


2184 








(12.8) 


(12.8) 


(6.0) 


(7.3) 


(14.6) 


(11.9) 


(15.0) 


(12.8) 




> baseline on 





42 


44 


38 


50 


52 


50 


48 


48 


TREC678 


MAP 


0.218 


0.257 


0.257 


0.250 


0.230 


0.266"*"' r 


0.260"'' 


0.265" 6 "' 


0.259" r 








(18.0) 


(17.7) 


(14.8) 


(5.6) 


(22.0) 


(19.2) 


(21.6) 


(18.7) 




P@10 


0.431 


0.438 


0.436 


0.420 


0.435 


0.441 


0.431 


0.435 


0.428 








(1.6) 


(1.1) 


(-2.6) 


(0.8) 


(2.2) 


(0.0) 


(0.8) 


(-0.8) 




#rel_rct 


7287 


8556 


8463 


8152 


7617 


8567 


8552 


8570 


8449 








(17.4) 


(16.1) 


(11.9) 


(4.5) 


(17.6) 


(17.4) 


(17.6) 


(15.9) 




> baseline on 





52 


60 


52 


45 


57 


57 


61 


58 


ROBnow 


MAP 


0.278 


0.312 


0.331 


0.327 


0.305 


0.326"'"' 


0.322"" 


0.341"*'"' 


0.341"*"*' 








(12.2) 


(19.0) 


(17.6) 


(9.8) 


(17.2) 


(15.9) 


(22.5) 


(22.6) 




P@10 


0.421 


0.405 


0.433 


0.452 


0.442 


0.438 


0.424 


0.455 


0.455 








(-3.8) 


(2.9) 


(7.2) 


(5.0) 


(4.1) 


(0.7) 


(7.9) 


(7.9) 




#rel_rct 


2887 


3172 


3202 


3009 


3002 


3173 


3160 


3214 


3218 








(9.9) 


(10.9) 


(4.2) 


(4.0) 


(9.9) 


(9.5) 


(11.3) 


(11.5) 




> baseline on 





52 


56 


53 


56 


55 


57 


62 


63 


TREC910 


MAP 


0.195 


0.193 


0.202 


0.175 


0.211 


0.204*' 


0.210*' 


0.207"' 


0.213*' 








(-1.1) 


(3.5) 


(-10.6) 


(8.0) 


(4.7) 


(7.4) 


(6.0) 


(9.1) 




P@10 


0.307 


0.293 


0.304 


0.291 


0.329 


0.313 


0.309 


0.320 


0.313 








(-4.6) 


(-1.0) 


(-5.3) 


(7.0) 


(2.0) 


(0.7) 


(4.3) 


(2.0) 




#rel_rct 


3770 


3987 


3948 


3646 


3889 


4021 


3992 


4016 


4018 








(5.8) 


(4.7) 


(-3.3) 


(3.2) 


(6.7) 


(5.9) 


(6.5) 


(6.6) 




> baseline on 





44 


45 


33 


53 


51 


50 


53 


48 



Table 5: Improvements on different datasets obtained by combining association based and distribution based QE methods. 
(The "> baseline on" line shows the %-age of queries on which each method beats the baseline by > 5%.) 



In general, the combination also seems to be safer, in the sense that combination- 
based expansion usually hurts fewer queries than expansion using either KLD 
or LCAnew. On a related note, a query wise analysis of the TREC678 dataset 
shows that out of the 150 queries in this collection, there are 59 queries on 
which KLD outperforms KLDLCA (with an average improvement in MAP of 
0.0148), but KLDLCA does better than KLD on 85 queries, and improves MAP 
by 0.0255 on average. Similarly, LCAnew performs better than KLDLCA on 68 
queries (average improvement in MAP = 0.0360), whereas KLDLCA wins on 
81 queries (average improvement in MAP = 0.0594). 

It is particularly encouraging that KLDLCA is also generally better than 
the two other state-of-the-art QE methods, RM3 and Bolnew, on all measures 
and across all datasets. The only exceptions are: RM3 yields better P@10 
on TREC4,TREC5, ROBnew and TREC910 and superior MAP for TREC910, 
while Bol outperforms KLDLCA on PO10 for TREC123, on MAP for ROBnew, 
and on the number of relevant documents retrieved for ROBnew. 

KLDRM3. KLDRM3 also yields better MAP than either KLD or RM3 on 
all collections (but neither difference is statistically significant for TREC4). It 
is also better than the other individual QE methods (LCAnew and Bolnew) 
on all corpora except ROBnew, where Bolnew outperform KLDRM3. This 
method is among the safest: only Bolnew yields improvements on marginally 
more queries for the TREC678 collection; on all other datasets, expansion by 
KLDRM3 improves performance on more queries than any other method. 

BolLCA, BolRM3. Both methods yield improvements (often significant) in 
MAP compared to all individual QE methods. Indeed, with a few exceptions, 
BolLCA is better than all individual QE methods for all the datasets and on 
all the measures. 

5.3 Discussion 

The results in the preceding section confirm our hypothesis that, on average, 
distribution and association based methods work well together. For queries such 
as 321 {Women in Parliaments), the combination works as expected. Both 
LCAnew (AP = 0.2531) and KLD (AP = 0.2611) select and assign relatively 
high weights to specific names such as mashokw, jankowska, starkova, fedulova. 
When LCAnew is used to filter terms based on association information obtained 
from 50 documents, these terms are eliminated, and retrieval effectiveness goes 
up (AP = 0.3629). 

More interesting are queries where the combination fails. Query 350 (Health 
and Computer Terminals) is one such example for which LCAnew (AP = 0.5911) 
and KLD (AP = 0.4512) both do reasonably well, but AP drops to 0.4007 for 
KLDLCA. For this particular query, filtering candidates terms using association 
information results in the elimination of a number of good expansion terms. 

Unfortunately, no general pattern seems to be discernible for such queries 
where a combination is inferior to cither or both of its ingredients. 
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6 Conclusion 



In this study, our objective was to combine distribution based and association 
based query expansion methods. Using a number of standard test collections, we 
have shown that distribution based QE can be improved by using an association 
based method to refine term selection. The proposed combination gives better 
results than each individual method, as well as other state-of-the-art approaches. 

En route to this goal, we also proposed some modifications to a few well- 
known QE methods which lead to improved performance. This may be regarded 
as an additional contribution of this paper. 

In future work, we intend to do a more comprehensive study by investigating 
other combinations of QE methods. 
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